1 Introduction and set-up

This project is made to be read in html, so open the html file in your preferred webbrowser. As standard the code is hidden in this document, but you can show all by pressing the button “Code” in the top right of the document. You can also show individual chunks of code by pressing the buttons “Code” which are placed around in the document.

Link for google colab:

Link for github: https://github.com/DataEconomistDK/M2-Group-Assignment

In this project we will work with a dataset of 5.000 consumer reviews for a few Amazon electronic products like f. ex. Kindle. Data is collected between September 2017 and October 2018. This is a sample taken from Kaggle which is a part of a much bigger dataset available trough Datafiniti. The data can be collected from this link: https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products?fbclid=IwAR1o_blPfHeBPmnUzAOW7Ct24L7fhbI3OGcbfaVgaDZENhVXwaCP4godKvQ#Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv

Note there is 3 available dataset on kaggle, but the file used here is called “Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products”. The file is downloaded as is, and imported further below.

1.1 Loading packages

First i have some personal setup in my local R-Markdown on how i want to display warnings ect. And then i load my packages.

### Knitr options
knitr::opts_chunk$set(warning=FALSE,
                     message=FALSE,
                     fig.align="center"
                     )

options(warn=-1) # Hides all warnings, as the knitr options only work on local R-Markdown mode. 

Sys.setenv(LANG = "en")
# Packages

if (!require("pacman")) install.packages("pacman") # package for loading and checking packages :)
pacman::p_load(knitr, # For knitr to html
               rmarkdown, # For formatting the document
               tidyverse, # Standard datasciewnce toolkid (dplyr, ggplot2 et al.)
               data.table, # for reading in data ect. 
               magrittr,# For advanced piping (%>% et al.)
               igraph, # For network analysis
               tidygraph, # For tidy-style graph manipulation
               ggraph, # For ggplot2 style graph plotting
               Matrix, # For some matrix functionality
               ggforce, # Awesome plotting
               kableExtra, # Formatting for tables
               car, # recode functions 
               tidytext, # Structure text within tidyverse
               topicmodels, # For topic modelling
               tm, # text mining library
               quanteda, # for LSA (latent semantic analysis)
               uwot, # for UMAP
               dbscan, # for density based clustering
               SnowballC,
               textdata,
               wordcloud, 
               textstem, # for textstemming 
               tidyr,
               widyr,
               reshape2,
               quanteda,
               uwot,
               dbscan,
               plotly,
               rsample,
               glmnet,
               doMC,
               broom,
               yardstick
               )

# I set a seed for reproduciability
set.seed(123) # Have to be set every time a rng proces is being made. 

1.2 Loading and filtering data

Now we load the data we downloaded from kaggle. From this file we select the following variables:

  • id: An id number given to each review created by us corrensponding to the row number of the raw data.

  • name: The full name of the product

  • reviews.rating: The rating of the product on a scale from 1-5.

  • reviews.title: The title of the review, given by the customer.

  • reviews.text: The review text written by the customer.

data_raw <- read_csv("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv") %>% 
  select(name, reviews.rating, reviews.text, reviews.title) %>% 
  mutate(id = row_number())

As the data is very raw and messy we now do some cleaning. We remove everything that is not normal letters, such as punctuations, numbers, special characters ect, and changing all strings to lower case in the review text.

We will also do some lemmatization. The purpose of this is to not only analyze the exact word strings in the reviews, as this would include several possible forms of the words used. F. ex. think and thought. Instead we want to merge all possible forms of a word into it’s root word. Lemmatization try and do so, by using detailed dictionaries which the algorithm looks trough to link a given word string back to it’s root word. This is a more advanced method than stemming and should be beneficial in this report.

We here want to primarily work with tidy text, where there is one token per row. So new a clean and filtered dataset is created both with tokens and as normal dataframe with the review text.

tokens_clean <- data_raw %>% 
  unnest_tokens(word, reviews.text, to_lower = TRUE) %>% 
  mutate(word = word %>% str_remove_all("[^a-zA-Z]")) %>%
  filter(str_length(word) > 0) %>% 
  mutate(word = lemmatize_words(word))

reviewtext_lemma <- tokens_clean %>% 
  group_by(id) %>% 
  summarize(reviews.text = str_c(word, collapse = " ")) %>% 
  ungroup() %>% 
  select(reviews.text) %>% 
  as_vector()

data_clean <- data_raw %>%
  mutate(reviews.text = reviewtext_lemma)

We now have 153.994 tokens, in their each seperate rows in the tokens dataset. By doing lemmatization the number of unique tokens are reduced from around 6000 to around 4600 words, which should prove quite beneficial.

2 Network analysis

In this assignment we want to use network analysis to gain new insights into how the reviews are structured. Here we extract bigrams from each review text, clean and prepare them to then create networks. Where we before considered tokens as individual words, we can create them as n-grams that are a consecutive sequence of words. Bigrams are n-grams with a length of 2 consecutive words. This can be used to gain context and connection between words.

Bigrams are now created, by unnesting the tokens.

bigrams <- data_clean %>%
  unnest_tokens(bigram, reviews.text, token = "ngrams", n = 2) # n is the number of words to consider in each n-gram. 

bigrams$bigram[1:2]
## [1] "the display" "display be"

Remember that each bigram overlap, as can be seen from above, so that the first token is “the display” and the second is “display is”. Now the most common bigrams are displayed.

#Counting common bigrams
bigrams %>% 
  count(bigram, sort = TRUE)

Notice the most common bigrams are: “for my”, “easy to”, “to use”, “it is”. These are mostly stopwords, which is not very usefull for the analysis. To remove these from the bigrams, we now split the bigram into 2 columns word1 and word2, and then filter them away if either of them is a stopword. The stopwords are taken from a dictionary called stop_words. Now we make a new a new count to see the most bigrams after filtering.

bigrams_separated <- bigrams %>% 
  separate(bigram,c("word1","word2"),sep = " ")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

#New bigram counts
bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

Above we can see that the most common bigrams are now mostly product names such as “kindle fire”, “battery life”, “amazon fire”, “amazon echo” ect. We now combine the 2 columns again into a single column with the bigram, to do further analysis. This is done by using the ‘tidyr’ function ‘unite’. The purpose is to treat the bigram as a ‘term in a document’.

The interesting thing is now to visualize the relationship between all words. To this we will use the package igraph. Before doing this we will need to create the graph from a data frame of the bigrams. Where we take a data frame of edges with columns for ‘from’, ‘to’. Which is equal to n.

2.1 Comment more about network

bigram_graph <- bigram_counts %>% 
  filter(n > 15) %>%  #Sets the combination to atleast 15
  graph_from_data_frame()



a<- grid::arrow(type = "closed",length = unit(.15,"inches"))

ggraph(bigram_graph, layout = "fr") + 
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.05,'inches'))+
  geom_node_point(color = "pink", size = 3) + 
  geom_node_text(aes(label = name),vjust=1,hjust=1) + 
  theme_void()

The plot above, show us some details about the structure of the text. We can for example see that such thing as ‘son’ ‘wife’ and ‘daugther’ are connected with ‘loves’.

3 Correlation bigrams

Check up on this!

#Correlation bigrams
bigram_section <- tokens_clean %>% 
  filter(!word %in% stop_words$word)

word_pairs <- bigram_section %>% 
  pairwise_count(word, id, sort = TRUE)

word_pairs
word_cors <- bigram_section %>% 
  group_by(word) %>% 
  filter(n()>= 20) %>% 
  pairwise_cor(word,id,sort=TRUE)

word_cors

Maybe choose other names

word_cors %>% 
  filter(item1 %in% c("kindle","fire")) %>% 
  group_by(item1) %>% 
  top_n(6) %>% 
  ungroup() %>% 
  mutate(item2 = reorder(item2, correlation)) %>% 
  ggplot(aes(item2, correlation)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

Maybe add colour scale.

word_cors %>% 
  filter(correlation > .275) %>% 
  graph_from_data_frame() %>% 
  ggraph(layout="fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = TRUE) +
  geom_node_point(color = "pink",size=3) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

4 NLP

In this section we will analyze the xxxx.

4.1 Data preprocessing

Up until now special characters, numbers and special letter have been removed and the tokens have been unnested. Words have also been through a lemmatazation. We will start to look at the top 100 words.

tokens_clean %>% count(word, sort=TRUE) %>% head(100)

Before looking for our own stopwords, we will move all stopwords build into the package, tidytext, called SMART.

tokens_clean %>% anti_join(stop_words)

After that we will look throuh the tokens_clean dataframe again and remove our own stopwords, where we decied to remove these five stopwords.

own_stopwords <- tibble(word= c("im", "ive", "dont", "doesnt", "didnt"), 
                        lexicon = "OWN")

Now we will remove out own stopwords. Afterwards we will filter first for ndoc, which is the total number of words in the document. Here we say that documents, here reviws, with less than five words in them.

tokens_stemmed <- tokens_clean %>% 
  anti_join(stop_words %>% bind_rows(own_stopwords), by = "word") %>% add_count(id, name = "ndoc") %>% filter(ndoc > 5) %>% select(-ndoc)

We will now again look at the top words and again plot them.

topwords <- tokens_stemmed %>% count(word, sort=TRUE)

topwords %>%
  top_n(20, n) %>%
  ggplot(aes(x = word %>% fct_reorder(n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Word Counts", 
       x = "Frequency", 
       y = "Top Words")

And now we will look at a wordcloud for the top 50 words. So pretty.

wordcloud(topwords$word, topwords$n, random.order = FALSE, 
          max.words = 50, colors = brewer.pal(8,"Dark2"))

4.2 TF-IDF

Up untill now, equal weight have been given to all words, but some are more rare than others. Term frequency–inverse document frequency or just tf-idf, is a way to analyze how important a word is to a document in a corpus:

\[\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)\] Here tf is the term-frequency and idf is the inverse document-frequency, a coefficient which is larger whenever the particular term is found in a lesser number of documents.

We tried to run a tf-idf analysis but we couldn’t really say anything from the analysis, probably because there’s a lot of documents. Every person has their own dictionary and a lot of words may appear very rare, and therefor they may be giving a high idf coefficient, which is why their tf-idf is high. If we were analyzing a number of books, the analyses may have made more sense.

4.3 Sentiment analysis

Sentiment analysis refers to a use of text analysis to extract and identify subjective information, where it analyzises whether the words are positive or negative. In this section, we will be doing two sentiment analysis, first by identifying positive and negative words using the bing lexicon and after this using the afinn lexicon. Hereafter, we will analyzie the data using the afinn lexicon, which gives every word a score between -5 and 5.

4.3.1 General analysis

Before doing the sentiment analysis, we will quickly look a the distribution of the review ratings.

summary(tokens_stemmed$reviews.rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.515   5.000   5.000

Here, we can see that there is a overepresentation of positive reviews, where the mean is at 4.515 and the median at 5.00. This will contribute to how we do the rest of the sentiment analysis. There is 1134 one-star review rating, 584 two-star review rating, 3028 three-star review rating, 12070 four-star review rating and 30383 five-star review rating.

4.3.2 Bing

We wil start with the bing lexicon. The bing lexicon categorizes words in a binary fashion into positive and negative categories. Here, we are using the function get_sentiment to get a specific sentiment lexicon and inner_join to join the lexcon with tokenized data. After this we can count the sentiments.

sentiment_bing <- tokens_stemmed %>% inner_join(get_sentiments("bing"))
sentiment_bing %>% count(sentiment)

Here we can see that there is a lot more positive than negative words, which we also explained earlier was due to the fact that there is a lot more positive then negative reviews.

Now we wil try to plot the sentiments, here grouped by positive and negative sentiments. We are plotting a word count, grouped by sentiment, showing the 10 most frequent negative and positive words.

sentiment_analysis <- sentiment_bing %>% filter(sentiment %in% c("positive"
, "negative"))

word_counts <- sentiment_analysis %>%
count(word, sentiment) %>%
group_by(sentiment) %>%
top_n(10, n) %>%
ungroup() %>%
mutate(
word2 = fct_reorder(word, n))

ggplot(word_counts, aes(x = word2, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ sentiment, scales ="free") +
coord_flip() +
labs(title ="Sentiment Word Counts",x ="Words")

Here we can see

And now we will count all positive and negative words for all five start reviews.

tokens_stemmed %>% inner_join(get_sentiments("bing")) %>% count(reviews.rating, sentiment)
tokens_stemmed_bing <- tokens_stemmed %>%
inner_join(get_sentiments("bing")) %>%
count(reviews.rating, sentiment) %>%
spread(sentiment, n) %>%
mutate(overall_sentiment = positive - negative)
ggplot(
tokens_stemmed_bing,
aes(x = reviews.rating, y = overall_sentiment, fill = as.factor(reviews.rating))
) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title =
"Overall Sentiment by Stars"
,
subtitle =
"Reviews for Robotic Vacuums"
,
x =
"Stars"
,
y =
"Overall Sentiment"
)

bing_word_counts <- tokens_stemmed %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()

bing_word_counts

4.3.3 Afinn

sentiment_afinn
ggplot(sentiment_afinn,aes(x = reviews.rating, y = sentiment, 
                           fill = as.factor(reviews.rating))) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title ="Overall Sentiment by Stars",subtitle ="Reviews for Robotic Vacuums",
     x ="Stars",y ="Overall Sentiment")

labs(title ="Sentiment Word Counts",x ="Words")
## $x
## [1] "Words"
## 
## $title
## [1] "Sentiment Word Counts"
## 
## attr(,"class")
## [1] "labels"

4.4 LSA

#Document-feature-matrix
data_dfm = tokens_stemmed %>% count(id, word) %>% cast_dfm(document = id, term = word, value = n)
data_dfm
## Document-feature matrix of: 3,306 documents, 3,543 features (99.7% sparse).
data_lsa_loading <- data_dfm$docs %>%
  as.data.frame() %>%
  rownames_to_column(var = "id") %>% 
  as_tibble()
data_lsa_umap %<>% as.data.frame()
data_lsa_hdbscan <- data_lsa_umap %>% as.matrix() %>% hdbscan(minPts = 100)
x = data_lsa_umap %>% 
  bind_cols(cluster = data_lsa_hdbscan$cluster %>% as.factor(), 
            prob = data_lsa_hdbscan$membership_prob) %>%
  ggplot(aes(x = V1, y = V2, col = cluster)) + 
  geom_point(aes(alpha = prob), shape = 21) 

ggplotly(x)

4.5 LDA

pacman::p_load(lda, # For LDA-analysis
               topicmodels) # LDA models

The topicmodels package requires a document-term matrix as input: By using the function cast_dtm og tidytext we can easily produce it. The matrix have to

data_dtm <- tokens_stemmed %>%
  count(id, word) %>%
  cast_dtm(document = id, term = word, value = n, weighting = tm::weightTf)

data_dtm
## <<DocumentTermMatrix (documents: 3306, terms: 3543)>>
## Non-/sparse entries: 40576/11672582
## Sparsity           : 100%
## Maximal term length: 18
## Weighting          : term frequency (tf)
data_dtm %>% removeSparseTerms(sparse = .99)
## <<DocumentTermMatrix (documents: 3306, terms: 245)>>
## Non-/sparse entries: 26472/783498
## Sparsity           : 97%
## Maximal term length: 13
## Weighting          : term frequency (tf)
data_dtm %>% removeSparseTerms(sparse = .999)
## <<DocumentTermMatrix (documents: 3306, terms: 1224)>>
## Non-/sparse entries: 37198/4009346
## Sparsity           : 99%
## Maximal term length: 14
## Weighting          : term frequency (tf)
data_dtm %>% removeSparseTerms(sparse = .9999)
## <<DocumentTermMatrix (documents: 3306, terms: 3543)>>
## Non-/sparse entries: 40576/11672582
## Sparsity           : 100%
## Maximal term length: 18
## Weighting          : term frequency (tf)
data_lda <- data_dtm %>%
  LDA(k = 3, method = "Gibbs",
      control = list(seed = 1337))
lda_beta <- data_lda %>%
  tidy(matrix = "beta") %>%
  group_by(topic) %>%
  arrange(topic, desc(beta)) %>%
  slice(1:10) %>%
  ungroup()

lda_beta %>% head()
lda_beta %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%
  arrange(desc(beta)) %>%
  ungroup() %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = NULL, y = expression(beta)) +
  facet_wrap(~ topic, ncol = 2, scales = "free")

It seems like cluster 1 contains some words with a tecnological character (tablet, screen, devic, app, game) while cluster 2 seems related to books and reading (kindl, read, book). The last cluster, 3, contains some positive words (love and smart) as the only cluster - besides “easi” in cluster 2

5 Machine learning

data_split <- data_clean %>% 
  select(id) %>% 
  initial_split()

train_data <- training(data_split)
test_data <- testing(data_split)

Transforming training data to a sparse Matrix

sparse_words <- tokens_clean %>% 
  count(id,word) %>% 
  inner_join(train_data) %>% 
  cast_sparse(id,word)

class(sparse_words)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(sparse_words)
## [1] 3750 3677
word_rownames <- as.integer(rownames(sparse_words))


data_joined <- data_frame(id = word_rownames) %>% 
  left_join(data_clean %>% select(id, reviews.rating))

rating_equal_5 <- data_joined$reviews.rating == "5"

model <-cv.glmnet(sparse_words,rating_equal_5, family="binomial",
                  parallel = TRUE, keep = TRUE)

plot(model)

plot(model$glmnet.fit)

coefs <- model$glmnet.fit %>% 
  tidy() %>% 
  filter(lambda == model$lambda.1se)

coefs %>% 
  group_by(estimate > 0) %>% 
  top_n(10, abs(estimate)) %>% 
  ungroup() %>% 
  ggplot(aes(fct_reorder(term, estimate),estimate, fill =estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) + 
  coord_flip() + 
  labs(x = NULL, title = "SDS ER LIVET", subtitle = "OG I VED DET")

intercept <- coefs %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

classifications <- tokens_clean %>%
  inner_join(test_data) %>%
  inner_join(coefs, by = c("word" = "term")) %>%
  group_by(id) %>%
  summarize(score = sum(estimate)) %>%
  mutate(probability = plogis(intercept + score))

classifications